Skip to content

batch job queue: adds initial empty implementation#27841

Open
mismithhisler wants to merge 12 commits intof-batch-job-queuefrom
f-bjq-initial-queue-impl
Open

batch job queue: adds initial empty implementation#27841
mismithhisler wants to merge 12 commits intof-batch-job-queuefrom
f-bjq-initial-queue-impl

Conversation

@mismithhisler
Copy link
Copy Markdown
Member

Description

These changes add an initial draft implementation of an empty batch job queue, to include basic queues core data structures and ability to watch evals for job placement. In order to facilitate easier reviews, a lot of implementation has been left for further PR's.

Testing & Reproduction steps

Links

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@mismithhisler mismithhisler self-assigned this Apr 16, 2026
@mismithhisler mismithhisler changed the base branch from main to f-batch-job-queue April 16, 2026 18:39
@mismithhisler mismithhisler marked this pull request as ready for review May 4, 2026 20:26
@mismithhisler mismithhisler requested review from a team as code owners May 4, 2026 20:26
Comment thread nomad/queues/batch_job_queue.go Outdated
Copy link
Copy Markdown
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work @mismithhisler. The only potentially blocking concern for me here is the waitForPlacement blocking forever.

Comment thread nomad/queues/priority_queue.go Outdated
old := *pq
n := len(old)
item := old[n-1]
old[n-1] = nil // don't stop the GC from reclaiming the item eventually
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? I'm pretty sure the old slice goes out of scope at the end of this function and the *pq = old[0:n-1] is copying its (ptr, len, cap)`, not its contents. https://go.dev/play/p/xE0glBdR8O6

Maybe I'm missing something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I took this directly from the container/heap priorityQueue example here.

I'll take a deeper look at this today.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some quick digging and found golang/go#65403 and golang/go#65404 which leads to this Gerrit discussion https://go-review.googlesource.com/c/go/+/559775

I guess I'm wrong?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://go.dev/play/p/7hR-IWTT7J9 shows it: the old pointer still lives in the backing array. If you uncomment the old[n-1] = nil and re-run this, it'll show as 0x0 (nil)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering if this has to do with the fact the capacity of the slice is still the same, so it's probably still holding onto that pointer, even though it doesn't appear so?

Comment thread nomad/queues/batch_job_queue_test.go Outdated
// to an internal channel to be processed and added to the actual
// heap container.
func (d *DynamicPriorityQueue) Enqueue(e *structs.Evaluation) {
w := d.generateWorkload(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this isn't wired-up yet but do we imagine we'll return the empty workload here if this is a non-batch job, or will we just not call Enqueue for those in the first place?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't completely figured out what the best way to "route" these evaluation is yet. I was thinking that only batch jobs would be routed to this Queue, and then if for example, they didn't have the required metadata flag (if metadata was set), then we would just pass to the eval broker?

Open to ideas here though.

Comment thread nomad/queues/batch_job_queue.go Outdated
Comment on lines +166 to +167
// Wait for the eval to be placed
d.waitForPlacement(ctx, workload.eval)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a job author writes a job that can't ever be placed because they screwed up a constraint. Doesn't this end up blocking forever and preventing any further jobs from being enqueued to the eval broker? Do we need some way of "abandoning" a workload in this queue or otherwise saving it to be retried later?

Also, we don't do anything with the error returned from this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah at the moment this would block forever until the job was stopped and eval was marked complete. We could add some configurable limit to waitForPlacements that stops the blocking query after some period of time has gone by. I'm not sure there much we can do in the way of saving it to be retried later, as once it's released to the eval broker, it's now out of our hands.

Yeah I forgot to handle this error, I'll get that updated. 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there much we can do in the way of saving it to be retried later, as once it's released to the eval broker, it's now out of our hands.

That's a good point, but also makes me realize a more fundamental issue: won't any blocked eval also end up being re-submitted to this queue? Which means we'll be waiting here and never enqueing the blocked eval into the eval broker in the first place. Right now I don't think we ever unblock in the case of a blocked eval. We should probably add a test that covers this workflow.

Copy link
Copy Markdown
Member Author

@mismithhisler mismithhisler May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would filter out any evals that are not Eval.TrriggeredBy == EvalTriggerJobRegister, so we should be good there. But it will get a little complex with new versions of jobs, but I'm just trying to get a basic queue in here.

Hopefully next I'll start wiring it up, and working on state restore, which also will be a little complex because of leadership transfers.

Comment thread nomad/queues/batch_job_queue.go
Comment thread nomad/queues/batch_job_queue.go
Comment thread nomad/queues/batch_job_queue.go Outdated
Comment on lines +46 to +47
// conf contains user configurations for tuning the behavior of the queue
conf *DynamicPriorityConfig
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a TBD, but do we think we're just going to blow away the whole queue if this configuration gets changed via API?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think just getting a solid "restore the queue from state" functionality, and relying on it for any conf changes is probably the best way to go at least for now.

Comment on lines +59 to +63
select {
case <-doneCh:
t.Fatal("should not have exited")
default:
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this doesn't reliably exercise the desired behavior: waitForPlacement could still be on its first pass through the loop and not have had an opportunity to incorrectly return. If waitForPlacement were buggy this would be flaky rather than always fail.

I think the next subtest has the same problem?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're absolutely right, I'll look into a better way to write this test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a wait for the new test watchset which guarentees that the goroutine has begun it's actual blocking query before we upsert an eval update.

Co-authored-by: Tim Gross <tim@0x74696d.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants